Goto

Collaborating Authors

 descent direction


Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

Neural Information Processing Systems

The need for training multilingual multi-task speech processing (MSP) models that perform both automatic speech recognition and speech-to-text translation is increasingly evident. However, a significant challenge arises from the conflicts among multiple objectives when using a single model. Multi-objective optimization can address this challenge by facilitating the optimization of multiple conflicting objectives and aligning the gradient updates in a common descent direction. While multi-objective optimization helps avoid conflicting gradient updates, a critical issue is that when there are many objectives, such as in MSP, it is often {\em difficult to find} a common descent direction. This leads to an important question: Is it more effective to separate highly conflicting objectives into different optimization levels or to keep them in a single level? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To keep computation and memory overhead low, we incorporate a lightweight layer selection strategy that detects the most conflicting layers and uses only their gradients when computing the conflict avoidance direction. We conduct an extensive investigation using the CoVoST v2 dataset for combined multilingual ASR and ST tasks, along with the LibriSpeech and AISHELL-1 datasets for multilingual ASR, to identify highly conflicting objectives and determine the most effective training recipe among the three proposed multi-objective optimization algorithms.






A Proofs

Neural Information Processing Systems

Lemma 1. Assume that Assumptions 1 and 2 hold, the iterations satisfy the following inequality for all k 2 N: Combine Assumption 2 with Definition 4.6, we have the second moment of g(W Summing both sides of this inequality for k 2{1,...,K} and recalling Assumption 2 (a) gives Rearranging above inequality and dividing further by K yields the result. The second condition in Eq. 4.10 ensures that lim Summing both sides of this inequality for k 2{1,...,K} and recalling Assumption 2(a) gives It guarantees that the model moves towards the descending direction of the loss function. Following the experimental setup in Section 5.1, we demonstrate that the proposed method empirically satisfies Assumption 2(b), and visualize in Figure 7 the sufficient direction constant ยต for the (partial) convolutional layers of the four models during the end-to-end training using TREC. For SqueezeNet and ResNet-34, we show one block as the representative, since the other blocks have similar performance. Several insights can be drawn from Figure 7. (i) The value of ยต of each convolutional layer is consistently greater than zero, indicating that Assumption 2(b) is satisfied, further ensuring the convergence of the TREC-equipped CNNs.



96f2d6069db8ad895c34e2285d25c0ed-Supplemental.pdf

Neural Information Processing Systems

Smooth convex optimization problems over polytopes are an important class of problems that appear in many settings, such as low-rank matrix completion [1],structured supervised learning [2,3],electrical flowsovergraphs [4],video co-localization in computer vision [5], traffic assignment problems [6], and submodular function minimization [7].